In this paper, we will analyze the IMDB data to answer some questions, design a measurement framework for evaluating performance and identify an opportunity for investment.
Data Prep
In this section, we obtain and prepare data for analysis. Because of the memory and performance issues, we will use small files and further down-select data to enable a more fluid analysis. We will drop titles with fewer than 100 ratings and individuals who worked on only 1 title. Installing and Loading Libraries
## read files in and create dataframesname_basics <-read.csv("name_basics_small.csv")title_basics <-read.csv("title_basics_small.csv")title_episodes <-read.csv("title_episodes_small.csv")title_ratings <-read.csv("title_ratings_small.csv")title_crew <-read.csv("title_crew_small.csv")title_principals <-read.csv("title_principals_small.csv")# drop records with fewer than 2 titles from name_basics dfname_basics <- name_basics |>filter(str_count(knownForTitles, ",") >1)# drop records with fewer than 100 ratings from title_ratings dftitle_ratings <- title_ratings |>filter(numVotes >=100)
Furthermore, to ensure consistency across all data sets, we will apply the same filtering, i.e., excluding titles with fewer than 100 ratings, to the rest of the title tables:
Show the code
# filtering title basics dftitle_basics <- title_basics |>semi_join( title_ratings,join_by(tconst == tconst) )# filtering title crew dftitle_crew <- title_crew |>semi_join( title_ratings,join_by(tconst == tconst) )# filtering title episodes df on title idtitle_episodes_1 <- title_episodes |>semi_join( title_ratings,join_by(tconst == tconst) )# filtering title episodes df on parent title idtitle_episodes_2 <- title_episodes |>semi_join( title_ratings,join_by(parentTconst == tconst) )# combining filtered title episodes dfstitle_episodes <-bind_rows( title_episodes_1, title_episodes_2) |>distinct()## filtering title principals dftitle_principals <- title_principals |>semi_join( title_ratings,join_by(tconst == tconst) )# remove dfs we no longer needrm(title_episodes_1)rm(title_episodes_2)
Task 1
Correct the column types of the title tables using a combination of mutate and the coercion functions as.numeric and as.logical.
Q1. How many movies are in our data set? How many TV series? How many TV episodes?
To answer this question, we will use the title basics data set, which contains release and production information.
Content type is captured in the titleType column. We have 131,662 movies, 29,789 TV Series and 155,722 TV episodes.
# get a count of records by content typesdf1 <- title_basics |>group_by(titleType) |>summarize(number_of_records =n()) |>ungroup() |>mutate(number_of_records =comma(number_of_records, digits =0)) |>rename(title_type = titleType) |>arrange(desc(number_of_records))# plot the resulting dffig_content_count_type <-plot_ly(data = df1,y =~reorder(title_type, number_of_records),x =~number_of_records,type ="bar",orientation ="h",marker =list(color ="cerulean"),width =500,height =300)fig_content_count_type <- fig_content_count_type |>layout(title ="Number of Titles by Content Type",xaxis =list(title ="Number of Records"),yaxis =list(title ="") )fig_content_count_type
(Please note it’s an interactive chart - hover over it for interaction options.)
Q2. Who is the oldest living person in our data set?
To answer this question, we will use the name basics table, which has birth and death records. However, a quick examination of data highlights certain irregularities in death records. It appears that we are missing actual death records for a number of individuals who, despite being born prior to the 20th century, are appear to still be alive.
# list living persons by year of birth# Subset of data - 10 oldest presumably living persons name_basics |>filter(is.na(death_year) &!is.na(birth_year)) |>arrange(birth_year) |>head(10) |>gt()
nconst
primaryName
birth_year
death_year
primaryProfession
knownForTitles
nm5671597
Robert De Visée
1655
NA
composer,soundtrack
tt2219674,tt1743724,tt0441074,tt14426058
nm7807390
William Sandys
1767
NA
composer,soundtrack
tt4396584,tt3747572,tt4555594,tt0071007
nm1441282
Richard Dybeck
1811
NA
soundtrack
tt0021783,tt0022126,tt0036372,tt0037562
nm6711738
Albert Monnier
1815
NA
writer
tt0329972,tt3966780,tt6793558,tt15175930
nm1227803
C. Hostrup
1818
NA
writer,composer,actor
tt0031361,tt0134089,tt0844680,tt14463014
nm1329526
Edouard Martin
1825
NA
writer
tt0200268,tt0329972,tt3966780,tt0036496
nm1197286
Ion Ivanovici
1845
NA
composer,soundtrack
tt0043412,tt0040391,tt1324061,tt0083697
nm0179107
Attilio Corbell
1850
NA
actor
tt0009508,tt0009121,tt0182770,tt0007472
nm0843185
André Sylvane
1850
NA
writer
tt0019480,tt0155273,tt0159028,tt0167460
nm0242243
Charles Dungan
1853
NA
actor
tt0267008,tt0008259,tt0008876,tt0003634
# create a df with records of living personsdf3 <- name_basics |>filter(is.na(death_year) &!is.na(birth_year)) |>group_by(birth_year) |>summarise(number_of_records =n()) |>ungroup() |>arrange(birth_year)# plot the resulting dffig_cnt_living_persons <-plot_ly(data = df3,x =~birth_year,y =~number_of_records,type ="bar",marker =list(color ="cerulean"),width =500,height =300)fig_cnt_living_persons <- fig_cnt_living_persons |>layout(title ="Living Persons by Year of Birth",xaxis =list(title ="Year of Birth"),yaxis =list(title ="Count of Living Persons") )fig_cnt_living_persons
(Please note it’s an interactive chart - hover over it for interaction options.)
Since we can’t manually verify verify hundreds of questionable records, we will have to use a rule-based approach to answer this question. The oldest verified person to have ever lived was 122 years and 164 days at the time of death so using this age as a threshold, we can filter out all individuals born after 1902, which leaves us with 65 individuals born in 1903.
# list count of living persons by year of birthname_basics |>filter(is.na(death_year) &!is.na(birth_year) & birth_year >1902) |>group_by(birth_year) |>summarize(number_of_records =n()) |>ungroup() |>arrange(birth_year) |>head(5) |>gt()
birth_year
number_of_records
1903
65
1904
77
1905
68
1906
83
1907
78
# list living persons born in 1903df4 <- name_basics |>filter(birth_year ==1903&is.na(death_year)) |>select(primaryName, birth_year, death_year) |>arrange(primaryName)sample_n(df4, 65) |> DT::datatable()
Q3. There is one TV Episode in this data set with a perfect 10/10 rating and 200,000 IMDb ratings. What is it? What series does it belong to?
To answer this question, we need to use 3 data sets, title ratings,title basics and title episodes.
# create df with list of all TV episodeslist_tv_epis <- title_basics |>filter(titleType =="tvEpisode") |>select(tconst, titleType, primaryTitle)# create df with list of all TV serieslist_tv_series <- title_basics |>filter(titleType =="tvSeries") |>select(tconst, titleType, primaryTitle)# create df with records of tv episodestv_ep_df1 <-inner_join(list_tv_epis, title_episodes, by ="tconst")# join ratings datatv_ep_df2 <-inner_join(tv_ep_df1, title_ratings, by ="tconst")# find a TV episode meeting criteriatv_ep_df3 <- tv_ep_df2 |>filter((numVotes >=200000) & (averageRating ==10))# map tv series nametv_ep_ratings_df <-inner_join(tv_ep_df3, list_tv_series, by =c("parentTconst"="tconst"))# rename columns in the resulting dftv_ep_ratings_df |>rename(episode_id = tconst,average_rating = averageRating,number_of_ratings = numVotes,title_type = titleType.x,episode_title = primaryTitle.x,series_id = parentTconst,series_name = primaryTitle.y,parent_title_type = titleType.y ) |>gt()
episode_id
title_type
episode_title
series_id
season_number
episode_number
average_rating
number_of_ratings
parent_title_type
series_name
tt2301451
tvEpisode
Ozymandias
tt0903747
5
14
10
227589
tvSeries
Breaking Bad
The TV episode with the perfect 10/10 rating and over 200K reviews is Ozymandias ep.15 season 5 of the cult TV hit Breaking Bad.
Q4. What four projects is the actor Mark Hamill most known for?
To answer this question, we will use name basics and title basics data sets.
# get title records for mark hamillmh_df <- name_basics |>filter(primaryName =="Mark Hamill") |>select(primaryName, knownForTitles) |>separate_longer_delim(knownForTitles, ",")# map titles names and types on the list of selected content IDsmh_df2 <-inner_join(mh_df, title_basics, by =c("knownForTitles"="tconst"))mh_df2 |>select(knownForTitles, titleType, primaryTitle, start_year) |>rename(title_id = knownForTitles,content_type = titleType,content_title = primaryTitle,year = start_year ) |>gt() |>tab_header(title ="Titles Mark Hamill is Known For" )
Titles Mark Hamill is Known For
title_id
content_type
content_title
year
tt0076759
movie
Star Wars: Episode IV - A New Hope
1977
tt2527336
movie
Star Wars: Episode VIII - The Last Jedi
2017
tt0080684
movie
Star Wars: Episode V - The Empire Strikes Back
1980
tt0086190
movie
Star Wars: Episode VI - Return of the Jedi
1983
Mark Hamill is known for his roles in the Star Wars movies, where he first starred in 1977 and most recently in 2017.
Q5. What TV series, with more than 12 episodes, has the highest average rating?
To answer this question, we need 3 data sets - title_episodes, title ratings and title basics.
# we already have a df with all TV series - we created it in a previous question - list_tv_series# create a df with records of tv series wirh all episodesep_filtered_series <-inner_join(title_episodes, list_tv_series, by =c("parentTconst"="tconst"))# df with tv series with 12+ episodesseries_num_epis <- ep_filtered_series |>group_by(parentTconst, primaryTitle, titleType) |>summarise(number_of_episodes =n()) |>ungroup() |>arrange(desc(number_of_episodes)) |>filter(number_of_episodes >=12) datatable(series_num_epis)
We have over 20K TV series with 12 or more episodes.
# join tv episodes and series data with ratings dataep_filtered_series_ratings <-inner_join(ep_filtered_series, title_ratings,by ="tconst")# drop all tv series with fewer than 12 episodesep_filtered_series_ratings2 <-inner_join(ep_filtered_series_ratings, series_num_epis,by ="parentTconst")# calculate average ratings for tv seriesep_filtered_series_ratings2 |>group_by(parentTconst, primaryTitle.x) |>summarise(average_rating =mean(averageRating)) |>ungroup() |>rename(tv_series_id = parentTconst,tv_series_title = primaryTitle.x ) |>arrange(desc(average_rating)) |>head(5) |>gt() |>tab_header(title ="Top 5 TV Series by Average Rating",subtitle ="TV series with 12 or more episodes only" )
Top 5 TV Series by Average Rating
TV series with 12 or more episodes only
tv_series_id
tv_series_title
average_rating
tt0409579
Made
10.0
tt11363282
The Real Housewives of Salt Lake City
10.0
tt21278628
Cowboys of Thunder
10.0
tt0060008
The Milton Berle Show
9.9
tt0168358
Parkinson
9.9
There are 3 TV series that obtained the perfect 10/10 rating - ‘Made’,‘The Real Housewives of Salt Lake City’ and ‘Cowboys of Thunder’.
Q6. Is it true that episodes from later seasons of Happy Days have lower average ratings than the early seasons?
To answer this question, we will use title basics,title episodes and title ratings data sets:
# create df for TV series 'Happy Days'hd_df1 <- title_basics |>filter(primaryTitle =="Happy Days"& titleType =="tvSeries")# join HD df with detailed TV episodes datahd_detail <-inner_join(title_episodes, hd_df1, by =c("parentTconst"="tconst"))# join ratings data to detailed Happy Days recordshd_detail_ratings <-inner_join(hd_detail, title_ratings, by ="tconst")datatable(hd_detail_ratings)
Now that we have detailed records on all episodes of the Happy Days TV series, we can calculate the average rating for each season.
It appears that the earlier seasons of the series indeed had higher average ratings compared to the more recent seasons.
# create df with average rating by seasonavg_hd_detail_ratings <- hd_detail_ratings |>group_by(season_number) |>summarise(avg_rating_season =mean(averageRating)) |>ungroup() |>arrange(season_number)# plot the resulting dffig_hd_seasons <-plot_ly(data = avg_hd_detail_ratings,x =~season_number,y =~avg_rating_season,type ="bar",marker =list(color ="cerulean"),width =500,height =300)fig_hd_seasons <- fig_hd_seasons |>layout(title ="Happy Days - Average Rating by Season",xaxis =list(title ="Season #"),yaxis =list(title ="Average Rating") )fig_hd_seasons
(Please note it’s an interactive chart - hover over it for interaction options.)
Task 3
Design a ‘success’ measure for IMDb entries, reflecting both quality and broad popular awareness.
As we found in Q1 in Task1, movies constitute the absolute majority of records in our data - 131.6K records vs 29.8K for TV series, the next largest category of content. We do not include TV episode in this analysis as TV episodes are not a standalone content. Given the obvious differences in production, marketing, and audience appeal, we will focus on movies for this part of the exercise.
# plot number of records by content type from the earlier questionfig_content_count_type
(Please note it’s an interactive chart - hover over it for interaction options.)
Let’s start with creating a data frame with ratings data for movies.
# create df with list of all movieslist_movies <- title_basics |>filter(titleType =="movie") |>select(tconst, titleType, primaryTitle, start_year, genres, runtime_minutes, isAdult)# join with ratings datamovie_ratings_df <-inner_join(list_movies, title_ratings, by ="tconst")movie_ratings_df2 <- movie_ratings_df |>rename(title = primaryTitle,title_id = tconst,content_type = titleType,year = start_year,average_rating = averageRating,number_of_votes = numVotes )# sample movie dfsample_n(movie_ratings_df2, 1000) |> DT::datatable()
Next we will conduct an explanatory data analysis on our movies data set to better understand the two ratings metrics.
average_rating number_of_votes
Min. : 1.000 Min. : 100
1st Qu.: 5.200 1st Qu.: 195
Median : 6.100 Median : 459
Mean : 5.923 Mean : 8694
3rd Qu.: 6.800 3rd Qu.: 1664
Max. :10.000 Max. :2942823
# histogram of average ratings# plot a histogram of number of ratings in plotlyavg_ratings_x <- movie_ratings_df2$average_ratingfig_hist_avg_ratings <-plot_ly(x = avg_ratings_x,type ="histogram",nbinsx =100,marker =list(color ="cerulean")) |>layout(title ="Distribution of Average Movie Ratings",xaxis =list(title ="Average Rating"),yaxis =list(title ="Frequency") )fig_hist_avg_ratings
(Please note it’s an interactive chart - hover over it for interaction options.)
# histogram of average ratings# plot a histogram of number of ratings in plotlynum_ratings_x <- movie_ratings_df2$number_of_votesfig_distr_number_ratings <-plot_ly(x = num_ratings_x,type ="histogram",nbinsx =80,marker =list(color ="cerulean")) |>layout(title ="Distribution of Movie Ratings",xaxis =list(title ="Number of Ratings"),yaxis =list(type ="log", title ="Frequency (Log-Scaled)") )fig_distr_number_ratings
(Please note it’s an interactive chart - hover over it for interaction options.)
Looking at descriptive statistics and statistical plots, we can see that most titles have relatively high average ratings. 50% of all titles have a rating above 6.1, and top 25% of titles have a rating over 6.8. Distribution of number of ratings, on the other hand, has a right skew, meaning that we have only a handful of titles with a very high number of votes.
Since we need to design a blended performance metric, we need to account for quality and popularity of a title simultaneously which can be done by an averaging of these two metrics. Before we proceed, we need to standardize the data to account for differences in magnitude and distribution of ratings and votes variables:
# calculate mean and standard deviation for ratings and votes datamovie_ratings_df3 <- movie_ratings_df2 |>mutate(avg_ratings_movies =mean(average_rating),avg_number_ratings =mean(number_of_votes),sd_avg_ratings =sd(average_rating),sd_number_ratings =sd(number_of_votes) )sample_n(movie_ratings_df3, 1000) |> DT::datatable(options =list(pageLength =5 ))
Now we can create standardized metrics for ratings and votes, as well as the blended performance index reflecting the quality of the movie (via average rating) and the popularity of the movie (via number of ratings), with equal weight given to each input.
# descriptive statistics for performance indexmovie_ratings_df4_pi <- movie_ratings_df4 |>select(performance_index)summary(movie_ratings_df4_pi)
performance_index
Min. :-1.990000
1st Qu.:-0.360000
Median : 0.030000
Mean : 0.000264
3rd Qu.: 0.310000
Max. :27.490000
# histogram of performance indexpi_x2 <- movie_ratings_df4$performance_indexfig7 <-plot_ly(x = pi_x2,type ="histogram",nbinsx =200,marker =list(color ="blue")) |>layout(title ="Distribution of Movie Performance Indices",xaxis =list(title ="Performance Index"),yaxis =list(title ="Frequency") )fig7
# % of titles with negative PImovie_ratings_df4_pi |>summarise(titles_with_negative_pi =sum(performance_index <0),all_titles =n() ) |>mutate(share_of_titles_with_negative_pi =round(titles_with_negative_pi / all_titles, 2)) |>gt()
titles_with_negative_pi
all_titles
share_of_titles_with_negative_pi
61673
131662
0.47
Performance index penalizes titles with subpar, i.e., below average, popularity and/or quality. 47% of movies in our data set have negative performance index.
Performance Index Validation
1.Choose the top 5-10 movies on your metric and confirm that they were indeed box office successes.
# top 5 moviesmrdf <- movie_ratings_df4 |>select(title, year, genres, average_rating, number_of_votes, performance_index)mrdf |>arrange(performance_index) |>slice_max(performance_index, n =5) |>gt() |>tab_header(title ="Top 5 Movies by Peformance Index" )
Top 5 Movies by Peformance Index
title
year
genres
average_rating
number_of_votes
performance_index
The Shawshank Redemption
1994
Drama
9.3
2942823
27.49
The Dark Knight
2008
Action,Crime,Drama
9.0
2922922
27.20
Inception
2010
Action,Adventure,Sci-Fi
8.8
2595555
24.20
Fight Club
1999
Drama
8.8
2374722
22.23
Forrest Gump
1994
Drama,Romance
8.8
2301630
21.57
Among top 5 movies based on performance index, four (with the exception of The Shawshank Redemption) were commercial successes, and The Shawshank Redemption is still widely considered to be one of the beloved and most critically acclaimed movies of all times.
Choose 3-5 movies with large numbers of IMDb votes that score poorly on your success metric and confirm that they are indeed of low quality.
# add this to top line to change plot size: , fig.width=4,fig.height=4}# plot ratings and votes datagfig <-ggplot(data = movie_ratings_df2, aes(x = average_rating, y = number_of_votes)) +geom_point(size =1, color ="blue") +labs(title ="Movie Quality (Average Rating) and Popularity (Number of Ratings)",x ="Average Rating",y ="Number of Ratings" ) +theme_minimal() +theme_bw() +scale_x_log10(label = scales::comma) +scale_y_log10(label = scales::comma) gfig
As seen on this chart, we should have a decent number of movies with average rating of 1-2 and 80K-100K number of ratings, so we will look up titles meeting these criteria:
Indeed, these movies score very poorly on the performance index, and while they have a relatively large volume of ratings, they also have low average ratings.
Choose a prestige actor or director and confirm that they have many projects with high scores on your success metric.
Steven Spielberg, one of the most famous and successful directors of our time, has 4 very successful projects with performance index of ranging from 4.65 to 14.54, which puts these titles in top 1% of all movies in our data set.
# get title records for Steven Spielbergbp_df <- name_basics |>filter(primaryName =="Steven Spielberg") |>select(primaryName, knownForTitles) |>separate_longer_delim(knownForTitles, ",")# map titles names and types on the list of selected content IDsbp_df2 <-inner_join(bp_df, title_basics, by =c("knownForTitles"="tconst"))bp_df3 <- bp_df2 |>select(primaryName, knownForTitles, titleType, primaryTitle) |>rename(name = primaryName,title_id = knownForTitles,content_type = titleType,content_title = primaryTitle )#select performance index and titlemovie_pi_df <- movie_ratings_df4 |>select(title_id, average_rating, number_of_votes, performance_index)# join to SS recordsbp_df4 <-inner_join(bp_df3, movie_pi_df, by ="title_id")datatable(bp_df4)
# percentiles for performance indexquantile(movie_ratings_df4$performance_index, probs =c(0,0.125,0.375,0.625,0.875,0.9,0.95,0.99,1))
We will use 0.38 (top 20% score cutoff) as a threshold of success - titles with performance index of 0.38 or higher are high performers.
Task 4: Trends in Success Over Time
We need to review our records in the context of distribution of titles by decade and genre.
Due to a low volume of production and a stable share of successful productions over time, we can exclude data prior to 1970.
# add new columns for decade and success movie_ratings_df4<-movie_ratings_df4 |>mutate(decade=floor(year/10)*10,success_flag=case_when( performance_index>0.38~1, performance_index<=0.38~0) )movie_ratings_df4_agg_decade<-movie_ratings_df4 |>select (title_id, title,genres,decade,year,performance_index,success_flag) |>group_by(decade) |>summarise(number_of_titles=n(),number_of_successes=sum(success_flag==1),number_of_flops=sum(success_flag==0)) |>ungroup()# plot the resulting dffig_decade <-plot_ly(data = movie_ratings_df4_agg_decade,x =~decade,y =~number_of_successes,type ="bar",name='number of successes',# marker = list(color = "blue"),width =500,height =300) |>add_trace(y=~number_of_flops, name='number of flops') fig_decade <- fig_decade |>layout(title ="Titles by Decade",xaxis =list(title ="Decade"),yaxis =list(title ="Count of Titles"),barmode='stack' )fig_decade
(Please note it’s an interactive chart - hover over it for interactive options.)
#create a df for genresmovie_ratings_df4_agg_genres<-movie_ratings_df4 |>select (title_id, title,genres,decade,year,performance_index,success_flag) |>group_by(genres) |>summarise(number_of_titles=n(),number_of_successes=sum(success_flag==1),number_of_flops=sum(success_flag==0)) |>arrange(desc(number_of_titles)) |>ungroup()# plot the resulting dffig_genres <-plot_ly(data = movie_ratings_df4_agg_genres,x =~reorder(genres,-number_of_titles),y =~number_of_titles,type ="bar",marker =list(color ="blue"))fig_genres <- fig_genres |>layout(title ="Count of Titles by Genre",xaxis =list(title ="Genres",tickangle =-45,tickfont =list(size =8)),yaxis =list(title ="Count of Titles") )fig_genres
(Please note it’s an interactive chart - hover over it for interactive options.)
We have a very large number of genres with only a handful of titles, so we can exclude these records from our data set to ensure our analysis is as robust as possible.
#top 20 genres by count of titles movie_ratings_df4_top20_genres<-movie_ratings_df4_agg_genres |>slice_max(number_of_titles,n=20)# subset data by decade and aggregate count successes and flopsmovie_ratings_df4_decade_genres<-movie_ratings_df4 |>filter(year>=1970) |>select (title_id, title,genres,decade,year,success_flag) |>group_by(genres,decade) |>summarise(number_of_titles=n(),number_of_successes=sum(success_flag==1),number_of_flops=sum(success_flag==0)) |>mutate(percent_of_success=round(number_of_successes/number_of_titles,2)) |>ungroup() movie_ratings_df4_decade_genres_filtered<-inner_join(movie_ratings_df4_decade_genres, movie_ratings_df4_top20_genres,by="genres") |>select(genres,decade,number_of_titles.x,number_of_successes.x,number_of_flops.x,percent_of_success) |>rename(number_of_titles=number_of_titles.x,number_of_successes=number_of_successes.x,number_of_flops=number_of_flops.x )datatable(movie_ratings_df4_decade_genres_filtered)
1.What was the genre with the most “successes” in each decade?
Drama produced more successes than other genres in 1970s (292 titles), 1980s (341 titles),and 1990s (334 titles). Starting in 2000s, Documentary took over with 747 successes in 2000s, 1290 successes in 2010, and 593 successful titles in 2020.
#subset successes by genre and decademovie_ratings_df4_decade_genres_filtered_successes_pw<-pivot_wider( movie_ratings_df4_decade_genres_filtered,id_cols = genres,names_from =decade ,values_from = number_of_successes) datatable(movie_ratings_df4_decade_genres_filtered_successes_pw)
What genre consistently has the most “successes”?
Drama and documentary collectively produced more successes than other genres (2915 and 2868, respectively), with Documentary emerging as a leading genre in recent decades (2000- present).
gp1<-ggplot(movie_ratings_df4_decade_genres_filtered,aes(x = decade, y = number_of_successes) ) +geom_col(fill='green4') +labs(title ="Successful Productions by Genre",x ="Decade", y ="Successful Productions") +geom_text(aes(label = number_of_successes), position =position_stack(vjust =2), # Place labels outside the barssize =2) +facet_grid(. ~ genres)+facet_wrap(~genres,ncol=4, strip.position ="top")+theme_minimal() +theme (axis.text.x =element_text(size =8), axis.text.y =element_text(size =8), axis.title.x =element_text(size =10), axis.title.y =element_text(size =10),plot.title =element_text(hjust =0.5, size =14) )gp1
What genre has produced the most “successes” since 2010? Does it have the highest success rate or does it only have a large number of successes because there are many productions in that genre?
# create a custom color palettepalette_genres3<-c("dodgerblue2", "#E31A1C", "green4","#6A3D9A","#FF7F00","black", "yellow","skyblue2", "#FB9A99","palegreen2","#CAB2D6","#FDBF6F","gray70", "khaki2", "maroon", "orchid1", "deeppink1", "blue1", "steelblue4","darkturquoise","green1", "yellow4", "yellow3","darkorange4", "brown")#create a chartfig_top20g <-plot_ly(movie_ratings_df4_decade_genres_filtered,x =~decade, y =~percent_of_success, color =~genres,type ='scatter',mode ='lines',colors = palette_genres3 )|>layout(title ="Top 20 Genres - Percent of Success by Decade",xaxis =list(title =""),yaxis =list(title ="Percent of Success",tickformat =".0%",range =c(0, 1) ),legend =list(orientation ='h', # Horizontal legend x =0.5, # Center horizontallyxanchor ='center', # Align centery =-0.2, # Position below the plotfont =list(size =8 ) # Smaller font size ) )# Show the plotfig_top20g
(Please note it’s an interactive chart - hover over it for interactive options.)
Documentary has produced most successful titles since 2010 (1883 titles) and it has the best success rate of all genres..
4.What genre has become more popular in recent years?
There has been a spike in success rate for Action genre, going from 5% in 2010s to 21% in 2020s.
Based on success rate, documentary is a clear standout and should be prioritized for investment opportunity.
Task 5: Key Personnel
Identify (at least) two actors and one director who you will target as the key talent for your movie. Write a short “pitch” as to why they are likely to be successful. You should support your pitch with at least one graphic and one table.
Since we are going to be developing a documentary title, we need to adjust this question a bit and identify a director-writer team as opposed to a director-actors team.
#get a list of titles in documentary genre, made after 1970 , with sufficient level of awareness and high performance index and map director and writer infodoc_df1<-sqldf(" with a as( select title_id, title, decade, performance_index, average_rating, number_of_votes, success_flag from movie_ratings_df4 where 1=1 and genres='Documentary' and year>=1970 and success_flag=1and number_of_votes>=5000 ) select a.*, t.directors, t.writers, n.primaryName as director_name, n2.primaryName as writer_name, tb.start_year from a inner join title_crew t on a.title_id=t.tconst inner join name_basics n on directors=n.nconst inner join name_basics n2 on t.writers=n2.nconst inner join title_basics tb on a.title_id=tb.tconst order by a.performance_index desc, a.number_of_votes desc, a.average_rating desc ; " ) # calculate performance statistics for director-writer teamsdoc_df2<-sqldf("select director_name,writer_name,director_name||'-'||writer_name as movie_team,count(title_id) as cnt_movies,avg(performance_index) as avg_performance_index,avg(average_rating) as avg_rating,avg(number_of_votes) as avg_number_of_ratingsfrom doc_df1group by 1,2,3having count(title_id)>1order by 4 desc ; " )datatable(doc_df2)
Looking at the high-performing documentaries from 1970s - present, 3 film makers have produced multiple successful titles: Werner Herzog, Michael Moore and the director-writer duo of Sophie Fiennes and Slavoj Zizek. Since we need to identify a team for our next project, we propose to approach the Fiennes-Zizek duo as they have already demonstrated they can successfully work together, which might not be the case for established solo creators Moore and Herzog.
# plot movie team datafig_movie_team<-plot_ly(data=doc_df2,x =~avg_rating,y =~avg_number_of_ratings,type ='scatter',mode ='markers',color=~movie_team ) |>layout(title ="Movie Team Performance Comparison",xaxis =list(title ="Quality (Average Rating)"),yaxis =list(title ="Popularity (Number of Ratings)"),legend =list(orientation ='h', # Horizontal legend x =0.5, # Center horizontallyxanchor ='center', # Align centery =-0.2 ) )fig_movie_team
Titles produced by Moore and Herzog appear to have a higher awareness among viewers but Fiennes-Zizek work is not far behind, and a more polarizing topic and a targeted marketing and PR campaign can help address this slight shortcoming.
Task 6: Finding a Classic Movie to Remake
Find a classic movie to remake with your key talent. The original should have a large number of IMDb ratings, a high average rating, and not have been remade in the past 25 years.
When looking at the top documentary titles, Super Size Me is a definite outlier: Super Size Me premiered at the 2004 Sundance Film Festival, where Morgan Spurlock won the Grand Jury Prize for directing the film.The film opened in the US on May 7, 2004, and grossed a total of $11,536,423 worldwide, making it the 7th highest-grossing documentary film of all time.It was nominated for an Academy Award for Best Documentary Feature and won the award for Best Documentary Screenplay from the Writers Guild of America. (Source).
A 2017 title Super Size Me 2: Holy Chicken! from the same director also performed reasonably well, even in the light of certain issues with with publicity and distribution. It’s important to note that this film was not a remake of a original title as it was focused on the process of opening a fast-food restaurant. (Source)
Given the success of the 2004 ‘Super Size Me’ and increasing popularity of the semaglutide drugs, we should consider making a documentary about a weight loss journey and impact of taking this medicine on one’s life, health and mind - a ‘Super Size Me’ journey in reverse. While this movie was released 20 years ago, cultural context, relevancy and timeliness play a huge role in documentary titles success, and for this topic the time is definitely now. Another reason to pursue this opportunity now is an unhappy one as Morgan Spurlock, the writer and director of both ‘Super Size Me’ titles, died in May of this year so re imagining his most famous work could serve as a tribute to Spurlock’s many talents and the impact his vision and creative genius left on our society. As a possible contributor to our project, we can consider Lee Fulkerson, who wrote and directed an award-winning and highly acclaimed documentary Forks Over Knives as he has already successfully explored the topic of self-improvement in his 2011 movie (performance index of 0.73).
Task 7: Write and Deliver Your Pitch
From Sophie Fiennes and Slavoj Zizek, the masters of philosophical and psychoanalytical exploration, and Lee Fulkerson, the visionary mind behind an inspiring story of human transformation, inspired by a critically acclaimed hit Super Size Me, comes the modern take on a timeless tale of metamorphosis, obsession and desire to be perfect at any cost. XXS Me: The Beginning coming to Netflix in December 2025.